Week 2 lecture notes

Basic architectural mechanisms involved in computing embeddings in Transformer language models (with a specific focus on BERT as conducted in Tenney, Das, & Pavlick, 2019)

Mathematical preliminaries

Breaking down the general idea behind input transformations for Transformer language models (especially BERT, à la ✏️Week 3 lecture notes)

Assume a sentence such as, “The quick brown ___” — what word goes in the blank? The goal of next-word prediction (NWP) models and masked language models (MLMs) is to correctly predict the next word, typically by minimizing cross-entropy loss.

Tokenization

What are the tokens in the following sentence: “The quick brown [MASK]”?

A simple assumption is that each “word” is a token, e.g., a Python array of strings like ['the', 'quick', 'brown', '[MASK]'] — this is the output you would get out of any standard tokenization algorithm, such as those provided by nltk, spaCy, or stanza.

Converting human-readable tokens into neural language model tokens

Recall that one prominent representation of the bag-of-words (BOW) for a document (e.g., a sequence of words, a sentence, a paragraph, and so on) is a single vector where each of the $k$ dimensions corresponds to one vocabulary term in $V$ . The dimensions are “fixed” and can be thought of as requiring some hashmap or dictionary structure that takes strings (e.g., "quick") and returns an index (e.g., some integer value like 1000). The tokenization process in neural language models canonically relies on transforming each string input (e.g., a single word) into its one-hot encoded representation, which creates a large, extremely sparse vector with a single 1 at the index corresponding to that string.

When conducted for all of the tokens, we create a tensor that contains an ordered set of vectors associated of all of the tokens, which we can call $\mathbf{T}$ . This tensor has the shape $k \times |V|$ . Typically speaking, k is bounded by hardware constraints and in many settings, e.g., 512 dimensions is the size that some TPUs circa 2018 were able to handle. As technology advances, the number of tokens that can be ingested simultaneously may increase or decrease.

NB: Many things that we consider to be “words” in languages like English will not have a corresponding index in the vocabulary, because the vocabularies of these tokenizers are usually fixed. The trick to representing words that are not already in the vocabulary is using subwords in the model’s vocabulary, which also include character-level “vocabulary” items. Subword vocabularies are usually obtained (or “learned”) using algorithms such as Byte Pair Encoding (BPE), Wordpiece, or UnigramLM/UniLM. Each algorithm works slightly differently and importantly top-down, pruning-based algorithms like UniLM learn more morphologically faithful subwords than bottom-up, merge-based algorithms like BPE.

Converting token representations to embeddings

$\mathbf{T}$ is a matrix or tensor containing one-hot encoded representations of shape $k \times |V|$ . If we want to obtain $d$ -dimensional embeddings associated with each of the word, we need to do some basic matrix multiplication using an embedding matrix $\mathbf{E}$ which has shape $|V| \times d$ . By multiplying $\mathbf{T}$ by $\mathbf{E}$ , we obtain a matrix of shape $k \times d$ , which gives us embeddings for each word in the input. We can call the resulting matrix $\mathbf{Z}$ here.

NB: The representations at the input stage have no contextual information because the embedding representations obtained from $\mathbf{T} \times \mathbf{E}$ for each token $t_i$ do not know anything about the other $t$ tokens. Consequently, the embedding matrix in practice is initially trained to produce static word vector representations using algorithms like word2vec, GloVe, etc. over a subword vocabulary on less data than the full models. In principle, it is possible to set $\mathbf{E}$ to be learned by the model over time in the same way as any other embeddings within the model, but using pre-trained representations typically is believed to reduce training time. When training from scratch, the embeddings in E are randomly set.

Converting embeddings into position-sensitive embeddings

Now that we have a $k \times d$ matrix that has produced embeddings associated with each token in our input, we must perform an additional trick to make the embeddings somewhat context-sensitive. For this, we will use position encoding representations. Position encodings have a number of advantages. For example, they allow us to distinguish between different uses of the same word (e.g., “the”) in the same sentence, or possibly more important relationships that the models can leverage later, such as whether a capitalized word at the beginning of a sentence is capitalized due to its position, or because it is a named entity of some kind.

This matrix $\mathbf{P}$ is also a $k \times d$ matrix that is summed with $\mathbf{Z}$ . Position encodings typically follow a sinusoidal pattern that allows the model to place each word at relatively the beginning (0) versus relatively the end (1) of the input representation. The resulting matrix is then used as the first input to a Transformer language model.

Attention mechanism

Before we get to the next layer, we must then compute the attention scores, which will produce our hidden states (O). For this, we need some additional matrices — Q (Query), K (Key), and V (Value). In practice, most models have multiple attention computations at each layer and attentions have access to all other states elsewhere in the model. Additionally, the matrix multiplications computed by the QKV operates allow us to encode an entire sequence simultaneously, which is more efficient than how recurrent neural networks (RNNs) or long short-term memory models (LSTMs) make predictions because it enables large-scale parallelization.

We will continue discussions of the attentions next week. Stay tuned at ✏️Week 3 lecture notes!